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Abstract: Consider estimating the n by p matrix of means of an n by 
p matrix of independent normally distributed observations with constant 
variance, where the performance of an estimator is judged using a p by p 
matrix quadratic error loss function. A matrix version of the James-Stein 
estimator is proposed, depending on a tuning constant. It is shown to 
dominate the usual maximum likelihood estimator for some choices of of 
the tuning constant when n is greater than or equal to 3. This result also 
extends to other shrinkage estimators and settings. 
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1. Introduction 

Shrinkage estimators are usually set in the context of vector data. If x{n x 1) 
is a random vector with mean 6, then a shrinkage estimator of 6 takes the 
form 

a = x - ag(x;u) (1.1) 

where a > is a tuning parameter, and g(x, u) (nxl) is a "shrinkage function" , 
depending on the data x, and possibly on extra information in an auxiliary 
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random variable (or random vector) u. Let F(x, u) denote the joint distribution 
of x and it, depending on 0. The classic James-Stein estimator (Stein, 1956; 
James and Stein, 1961) is a special case in the setting x ~ N n (0, <J 2 I n ), n > 3. 
When a 2 is known, the shrinkage function is given by 

g{x) = a 2 {n-2)x/\\x\\ 2 . (1.2) 

When a 2 is unknown, the shrinkage function is given by 

g(x) = {u/(u + 2)}(n-2)x/\\x\\ 2 , (1.3) 

where u ~ a 2 xl is an auxiliary random variable independent of x which is 
used to estimate a 2 . 

The objective in shrinkage estimation is to estimate the vector parameter 0, 
where the performance of an estimator = 0(x) is judged by the scalar loss 
function 

n 

L scalar (O,0) = J2&- e *) 2 (1-4) 

i=i 

and associated risk function -R sca iar(0, 0) = E F {L sca i SiT (0, #)}• 

In order to guarantee that the shrinkage estimator dominates the simple 
unbiased estimator 6 = x, the usual strategy is to demonstrate the "cross- 
product inequality" 

E F {(x - Gfg} > E F (g T g) > 0, (1.5) 

for all 0, where g = g(x,u) is a function of the random vector x (and of 
u when present). The last inequality has been included to ensure that g is 
nontrivial. Throughout the paper we assume that x and g(x,u) have finite 
second moments. Then the following well-known result holds. 

Theorem 1. Let x(n x 1) be a random vector and u be an auxiliary ran- 
dom variable such that E(x) = under a probability model F depending on 
0. Also suppose there exists a shrinkage function g = g(x,u) such that the 
cross-product inequality (1.5) holds. Then the shrinkage estimator a in (1.1) 
dominates the simple estimator 0q = x under the scalar loss function (1.4) 
provided the tuning parameter a satisfies < a < 2. 

Proof. Write 5 = E F {(x - 0) T g} and 7 = E F (g T g), so 5 > 7 > 0. Then the 
risk takes the form 

i?scalar(£a, 6) = E F {(x - - agf \x - - ag)} 

= E F {(x - 0) T (x - 0)} - 2a5 + a 2 7 

< E F {(x - 0) T (x - 0)} - 2a 7 + a 2 7 

< E F {(x - 0f(x - 0)} = R sca UOo, 0) 
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provided < a < 2. □ 

For the James-Stein estimator, Stein's Lemma (Stein, 1981) states that the 
cross-product inequality for known a 2 holds for n > 3 and is actually an 
equality. That is, if x ~ N n (0, a 2 I n ), n > 3, then 

a 2 E [{x T (x - 6>)/||a;|| 2 } = (n - 2)a 4 E {l/\\x\\ 2 } = o 2 A, say, (1.7) 

where A = A(X 2 ) depends on A 2 = 9 T 0/a 2 and < A < oo. Stein's Lemma 
can be proved using integration by parts (e.g. Efron and Morris (1976) or Stein 
(1981)). An equality also holds in the analogue of (1.7) for the unknown a 2 
case since E[u) = ua 2 , E(u 2 ) = v[y + 2)a 2 in (1.3). Hence Theorem 1 for 
the James-Stein estimator, in both the known and unknown a 2 cases, can 
be strengthened to conclude that the optimal value of the tuning constant is 
a = 1, uniformly over all 0. 

The purpose of this paper is to extend James-Stein and other shrinkage 
estimators to a matrix setting where the data take the form of an n x p matrix 
X with mean E(X) = O. The objective is to estimate using a p x p matrix 
quadratic loss function, 

£matrix(e, 6) = {6 - 6} T {6 - 9}, (1.8) 

and associated risk function -R m atrix(0, ©) = E{L matrix (Q, 9)}. Note that this 
loss function pools errors across the n rows, but treats the p columns separately. 
Hence we look for an estimator which shrinks across the n rows, but does not 
shrink across columns. 

Other authors have considered the use of shrinkage methods in a matrix 
setting; see, e.g. Efron and Morris (1972, 1976); Haff (1977); Zheng (1986); 
Ghosh and Shieh (1991); Tsukuma and Kubokawa (2007); Tsukuma (2009). 
However, these papers use a scalar squared error loss function and so are not 
directly relevant here. There seems to be little work focused on a matrix loss 
function. 

Section 2 states the main result in the matrix setting, with some discussion 
given in Section 3. 

2. Matrix data 

Suppose the data take the form of an n x p matrix X, plus auxiliary random 
variables u = (ui, . . . , u p ) T , when present. Let = Ep(X) denote the n x p 
matrix of means, where F denotes the joint distribution of X and u. The 
objective is to estimate under the pxp matrix quadratic loss function (1.8). 
Let X(j) denote the jth column of X. 
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Suppose that for each column j = 1, . . . , p, there is a shrinkage function 
9(j) = 9(j)( x (j)i u j)- A natural estimator is the "diagonal shrinkage estimator", 
defined by applying the vector shrinkage estimator separately to each column 
of X. That is, define a = O a (X) in terms of its columns a ,(j) by 

0a,(j) = Oa{X(j),Uj) (2.1) 

using (1.1). Note that the shrinkage applied to each column does not depend on 
the data in other columns. We use the term "diagonal" because in the setting 
(1.2) the estimator can also be written in matrix form using a diagonal matrix, 

Q a = XD, D = diag(dj), dj = 1 — aa 2 (n — 2)/\\x^)\\ 2 , j = l,...,p. 

Given two estimators 0^ and depending on X, say that strictly 
dominates 6^ if -R ma trix(@ (1) , 0) < ^matrix (0 (2) , ©) for all 0, where " <" means 
that the difference between the right- and left-hand sides is a positive-definite 
matrix. The following theorem is the main result of this paper. 

Theorem 2. Let X{n x p) be a random matrix and u = (m, . . . ,u p ) T be a 
vector of auxiliary random variables such that Ep(X) = under a probability 
model F depending on 0, and the data {xu\,Uj} are independent for different 
j. Suppose there exist shrinkage functions = g^(xy),Uj) such that the 
cross-product inequality (1.5) holds for each j = 1, . . . ,p. Then the shrinkage 
estimator Q a in (2.1) dominates the simple estimator ©o = X under the 
matrix loss function (1.8) provided the tuning parameter a satisfies < a < 
2/p. 

Proof. The proof makes use of the following inequality, where o; is a p x 1 
vector and G is an n x p matrix with columns g^, j = 1, . . . ,p, 

v v 

a j a k9%)9{ k ) < N H2iH ll3(*)H 

j,k=l j,k=l 

■ p \ 2 



V 



J'=l 

The two inequalities follow from two versions of the Cauchy-Schwarz inequal- 
ity. 

To show a dominates 0o = X for a particular choice of a, we need to show 
that 

#matrix(0a, ©) < #matrix(0O, ©) for all 0. 
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Equivalently we need to show that 

ol t Ra < nor I n cx, = n for all 6, (2.3) 

where R = -R m atrix(@a, @) and ex is an arbitrary standardized p- dimensional 
vector, ck t q; = 1. 

The left-hand side of (2.3) can be written as 

v 



( t ~\ 

®j®kE I (o a>{j) - e u) j (o a> (k) - 0(k)) \ 
i,k=i < > 
p 

= a i a kE [{ (x {j) - 6 {j) ) - ag {j) } T { (x (k) - {k) ) - ag {k) ) 

j,k=i 

p p 
= a ) [ E { ( x ti) ~ °U)f ( X U) ~ 9 U)) } " 2aS i\ + a2 J2 a i akE (9l)9{ k )) 

j=l j,k=l 
P 

^ a i [ E { ( x (j) ~ °^)) T ( x 0) ~ } " 2a5 i + a2 P^ 
p 



< cx T RqCx = n, 



(2.4) 



for < a < 2/p, where Sj = E F {(x (j) - {j) ) T g {j) } and ^ = E F (g T {j) g {j) ), 
so 5j > 7j > 0. In going from the second to the third line of (2.4) notice 
that many of the off-diagonal terms vanish because the different columns are 
independent and E(xy) = 0. The fourth line follows from the third line 

by the Cauchy-Schwarz based inequality (2.2). The last line follows from the 
fifth line by simple properties of quadratic functions. □ 

Comments 

(a) The allowable interval for a decreases with p. This property is related to 
the result that for a matrix loss function, it is harder to dominate the 
maximum likelihood estimator than for a scalar loss function. 

(b) For the James-Stein case, the p-dimensional result is less powerful than 
the one-dimensional result. In one dimension a = 1 is optimal; 6\ domi- 
nates 6 a for all other choices of a. In contrast, if p > 1 there is no single 
choice of a for a which dominates all other choices. 

(c) Further, at least for the James-Stein case, the interval (0, 2/p) is the best 
possible interval for a. If a < or a > 2/p, it is possible to find values of 
such that B a does not dominate Gq. 
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Here is a simple construction in the case of known variance a 2 = 1. 
Recall Xij ~ N(9ij, 1) independently for i = 1, . . . ,n, j = 1, . . . ,p. Let 
a.j = 1/a/p, j = 1, . . . ,p. Let 0* be a n- vector of unit size, 6* T 6* = 1, 
and suppose all of the columns of 6 are equal to the same multiple of 
0*, 9fj\ = n0*. For large k it is straightforward to show that 

Si = lj = E(gl m ) = K/k 2 ) + 0(1/ k 4 ) 

for all j, where m = n — 2. Further (2.2) becomes an equality in this 
setting so that the risk in (2.3) reduces to 

ct T Rct = n- 2a{m 2 /K 2 ) + a 2 (m 2 /^ 2 )p + 0(1/ k 4 ). (2.5) 

Ignoring the remainder term, the quadratic function of a in (2.5) is less 
than n for < a < 2/p and exceeds n for a < or a > 2 /p. Hence for 
any specific choice of a < or a > 2/p, cx T Rcx > n for sufficiently large 

K. 

The same argument works for the case of unknown a 2 . 

(d) In the vector case, if the shrinkage function g is re-scaled to eg for some 
constant c > 0, then the cross-product inequality needs minor adjustment 
and the allowable interval for the tuning parameter a changes from (0, 2) 
to (0, 2/c). The scaling convention for the cross-product inequality chosen 
in this paper has been made to make the treatment of different columns 
as consistent as possible in the extension to the matrix case. 

(e) Efron and Morris (1972) proposed the "matrix" James-Stein estimator 

6 MJS = X{I P - (n-p-l)S- 1 }, S = X T X, 

and investigated its properties under the scalar loss function (1.4). How- 
ever, its properties under the matrix loss function (1.8) are unknown. 

3. Discussion 

For the classic vector James-Stein estimator there are several ingredients in 
the formulation of the problem and the estimator such as the following: (a) 
normality of the data, (b) uncorrelated components, (c) the specific choice 
(1.2) for the shrinkage function g, and (d) the assumption that the range of 
possible values for spans all of M. n . 

Each of these ingredients can be relaxed, either individually or in combina- 
tion. Here are some examples. 
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(a) relax normality to (i) more general spherical distributions 
(Brandwein and Strawderman, 1991; Cellier and Fourdrinier, 1995) or 
(ii) independent components (Shinozaki, 1984); 

(b) allow correlated normal or more general elliptic distributions 
(Fourdrinier, Strawderman and Wells, 2006); 

(c) use other shrinkage estimators such as (i) subspace shrinkage, or more 
generally (ii) Bayes or generalized Bayes estimators based on superhar- 
monic prior distributions (Stein, 1981); 

(d) relax the range of possible values for 6 from all of M n to a specified cone 
(Fourdrinier, Strawderman and Wells, 2006). 

In each case the improved performance of the shrinkage estimator is justified 
by a version of the cross-product inequality. Hence in each case there is an 
immediate extension to the matrix case. 

Another direction in which the paper might be extended is to allow depen- 
dence between the columns. At least in the normal case with a known p x p 
covariance matrix S, it is possible to adapt the results of this paper. 

Thus let X(n x p) follow an rap- dimensional normal distribution with mean 
E(X) = 0, with independent rows and with common covariance matrix £ 
within each row. Let A be a matrix square root of S -1 , so that AA T = S^ 1 . 
Then Y = XA has independent columns. Hence the methodology of Section 2 
can be applied to Y to yield an estimator $ a of $ = OA Back-transforming 
yields an estimator a = I^A" 1 which dominates 0o in the matrix sense (1.8), 
provided < a < 2. 

It is not clear to what extent these results carry over when £ needs to be 
estimated. Further, note that A is only defined up to a multiplication on the 
left by a p x p orthogonal matrix. Thus the methodology of Section 2 defines 
a whole family of estimators, each with the same statistical properties. It is 
not clear whether it might be possible to combine them in some way to yield 
a superior estimator. 
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